A Parallel SGD method with Strong Convergence

نویسندگان

  • Dhruv Kumar Mahajan
  • S. Sathiya Keerthi
  • S. Sundararajan
  • Léon Bottou
چکیده

This paper proposes a novel parallel stochastic gradient descent (SGD) method that is obtained by applying parallel sets of SGD iterations (each set operating on one node using the data residing in it) for finding the direction in each iteration of a batch descent method. The method has strong convergence properties. Experiments on datasets with high dimensional feature spaces show the value of this method. Introduction. We are interested in the large scale learning of linear classifiers. Let {xi, yi} be the training set associated with a binary classification problem (yi ∈ {1,−1}). Consider a linear classification model, y = sgn(wx). Let l(w ·xi, yi) be a continuously differentiable, non-negative, convex loss function that has Lipschitz continuous gradient. This allows us to consider loss functions such as least squares, logistic loss and squared hinge loss. Hinge loss is not covered by our theory since it is non-differentiable. Our aim is to to minimize the regularized risk functional f(w) = λ 2 ∥w∥ 2 + L(w) where λ > 0 is the regularization constant and L(w) = ∑ i l(w · xi, yi) is the total loss. The gradient function, g = ∇f is Lipschitz continuous. For large scale learning on a single machine, it is now well established that example-wise methods1 such as stochastic gradient descent (SGD) and its variations [1, 2, 3] and dual coordinate ascent [4] are much faster than batch gradient-based methods for reaching weights with sufficient training optimality needed for attaining steady state generalization performance. However, example-wise methods are inherently sequential. For tackling problems involving huge sized data, distributed solution becomes necessary. One approach to parallel SGD solution [5] is via (iterative) parameter mixing [6, 7]. Consider a distributed setting with a master-slave architecture2 in which the examples are partitioned over P slave computing nodes. Let: Ip be the set of indices i such that (xi, yi) sits in the p-th node; and Lp(w) = ∑ i∈Ip l(w · xi, yi) be the total loss associated with node p. Thus, f(w) = λ 2 ∥w∥ 2 + ∑ p Lp(w). Suppose the master node has the current weight vector w and it communicates it to all the nodes. Each node p can form the approximation,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Stochastic gradient descent (SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a f...

متن کامل

Fast Asynchronous Parallel Stochastic Gradient Decent

Stochastic gradient descent (SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a f...

متن کامل

Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme under Weak Strong Convexity Assumption

We propose a projected semi-stochastic gradient descent method with mini-batch for improving both the theoretical complexity and practical performance of the general stochastic gradient descent method (SGD). We are able to prove linear convergence under weak strong convexity assumption. This requires no strong convexity assumption for minimizing the sum of smooth convex functions subject to a c...

متن کامل

Asynchronous Accelerated Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a widely used optimization algorithm in machine learning. In order to accelerate the convergence of SGD, a few advanced techniques have been developed in recent years, including variance reduction, stochastic coordinate sampling, and Nesterov’s acceleration method. Furthermore, in order to improve the training speed and/or leverage larger-scale training data...

متن کامل

Distributed stochastic optimization for deep learning

We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze the convergence rate of the EASGD method in the synchronous scenario and compare its stability condition with the existing ADMM method in the round-robin sche...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1311.0636  شماره 

صفحات  -

تاریخ انتشار 2013